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Abstract 

This  work  investigates  a  combined  stochastic  and  deterministic 
optimization  approach  for  multivariate  mixture  density  estimation. 
Mixture  probability  density  models  are  selected  and  optimized  by 
combining  the  optimization  characteristics  of  a  multiagent  stochastic 
optimization  algorithm  based  on  evolutionary  programming  and  the 
expectation-maximization  algorithm.  Unlike  the  traditional  finite 
mixture  model,  generally  composed  of  a  sum  of  normal  component 
densities,  the  generalized  mixture  model  is  composed  of  shape-adaptive 
components.  Rissanen's  minimum  description  length  criterion  provides 
the  selection  mechanism  for  evaluating  mixture  model  fitness.  The 
classification  problem  is  approached  by  optimizing  a  mixture  density 
estimate  for  each  class.  A  comparison  of  each  class's  posterior 
probability  (Bayes  rule)  provides  the  classification  decision  procedure. 
A  classification  problem  is  posed,  and  the  classification  performance  of 
the  derived  generalized  mixture  models  is  compared  with  the 
performance  of  mixture  models  generated  using  normally  distributed 
components.  While  both  approaches  produced  excellent  classification 
results,  the  generalized  mixture  approach  produced  more  parsimonious 
density  models  from  the  training  data. 


1  INTRODUCTION 

The  area  of  nonparametric  density  estimation  is  proving  to  be  an 
increasingly  useful  tool  in  providing  a  mathematical  approach  for  the 
characterization  and  classification  of  complicated  data.  Several 
methods  of  nonparametric  density  estimation  have  been  proposed, 
including  (but  not  limited  to)  kernel  estimators  (Parzen  1962;  Silverman 
1986),  maximum  penalized  likelihood  estimators  (Good  and  Gaskins 
1971;  Silverman  1982),  and  the  method  of  mixtures  (McLachlan  1986). 
Classification  systems  based  on  some  neural  network  models,  such  as 
radial  basis  functions  (Moody  and  Darken  1989)  and  the  probabilistic 
neural  network  (Specht  1990),  are  mathematically  and  functionally 
equivalent  to  mixture  models  and  kernel  estimators,  respectively.  These 
neural  network  models  therefore  share  the  same  characteristics  and 
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limitations  of  their  statistical  analogues.  The  kernel  estimator  is 
simple  to  derive  from  the  data,  but  requires  the  entire  training  sample 
for  probabilistic  inference.  When  compared  to  mixture  models,  kernd 
estimators  have  larger  storage  and  longer  run-time  execution  speed 
requirements.  Finite  mixture  models  provide  data  reduction  and 
generali2ation,  thus  reducing  the  storage  requirements  and  improving 
execution  speed,  but  are  computationally  more  expensive  to  derive. 

This  work  investigates  the  use  of  combining  stochastic  search  with  the 
method  of  maximum  likelihood  for  the  optimization  of  mixture  density 
estimates,  where  the  number  of  components,  the  functional  shape  of  each 
component,  and  the  component  parameters  are  simultaneously 
optimized.  In  this  paper,  these  shape-adaptive  mixture  models  are 
called  generalized  mixtures.  The  classification  performance  of 
generalized  mixture  densities  is  computed  and  compared  to  normal 
component  mixtures  for  a  two-class  classification  problem. 

The  following  subsections  introduce  the  mixture  method  approach  for 
probability  density  estimation.  The  expectation-maximization  (EM) 
algorithm  for  parameter  optimization  is  described,  and  its  relationship 
to  finite  mixture  parameter  optimization  estimation  is  explained. 
Measures  of  model  fitness  and  complexity  are  also  discussed.  Section  2 
discusses  stochastic  approaches  to  optimization,  including  evolutionary 
programming.  Section  3  discusses  the  formulation  of  the  generalized 
mixture’s  components,  and  the  combining  of  the  EM  and  stochastic 
approaches  for  model  order  and  parameter  optimization.  Section  4 
describes  a  two-dimensional  density  estimation  problem,  and  the 
performance  characteristics  of  the  stochastic-EM  optimization  process. 
Classification  error  rates  of  the  evolved  optimal  generalized  mixtures 
are  computed  and  compared  to  normal-component  mixture  models  in 
section  5.  Conclusions  are  offered  in  section  6. 

Finite  mixture  methods 

A  finite  mixture  distribution  is  defined  informally  as  a  distribution  that 
decomposes  into  two  or  more  proportionally  scaled  probability 
distributions.  Mathematically,  a  mixture  probability  density  function 

/,  composed  of  q  probability  distributions  /,,.••//,/  is  defined  as  the 

following; 


/(xl^)  =  Ja/,(xl0)  (1) 

i=l 

where  $  is  the  vector  of  free  parameters  $=[a,6Y.  The  proportions 
a,,K  denote  the  relative  contributions  made  by  their  respective 
density  components.  Their  values  are  constrained  by  the  following: 


i 
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and  a,>0  (i  =  l,K  ,«).  (2) 


A  goal  of  mixture  model  density  estimation  is  to  produce  density 
estimates  where  the  number  of  mixture  components  q  is  much  smaller 
than  the  sample  size  n.  Mixture  models  therefore  attempt  to  provide 
some  of  the  computational  efficiency  associated  with  parametric 
density  estimation,  while  minimizing  the  number  of  assumptions 
concerning  the  true  underlying  distribution. 

Unfortunately,  analytical  optimization  of  a  finite  mixture  is 
complicated  even  for  a  moderate  sample  size  (Choi  and  Bulgren  1968).  A 
nonanalytical  solution  for  mixture  model  optimization  is  provided  by 
the  EM  algorithm. 

/ 

Expectation-maximization  algorithm  for  finite  mixhure  optimization 

An  important  generalization  of  the  method  of  maximum  likelihood  was 
developed  by  Dempster  et  al.  (1977),  in  which  an  iterative  procedure 
was  introduced  that  allows  maximum  likelihood  estimates  to  be 
generated  from  incomplete  data.  Here  "incomplete"  is  used  in  the  sense 
that  a  component  of  the  data  is  occasionally  (or  always)  missing,  and 
therefore  some  data  does  not  provide  values  for  the  variables  under 
consideration. 

Suppose  it  is  desired  to  find  the  maximum  likelihood  estimate  of  0 
for  the  likelihood  function  L{Q)-  g{x\Q),  where  x  is  a  set  of 
"incomplete"  data.  Let  y  be  a  complete  version  of  the  data,  and  let  the 
likelihood  ofy  be  denoted  as  /(yl0).  From  an  initial  approximation 

the  EM  algorithm  generates  an  iterative  sequence  of  estimates  0^*^ 
via  the  following  two  steps: 

Estep:  Compute  e(0l0<*>)  =  £[{log/(7l6»)}U,6i'‘']  (3) 

Mstep:  Set  =  m^Q(0\9^^^)  (4) 

In  solving  a  finite  mixture  problem,  where  the  number  of  components  is 
determined  a  priori,  Redner  and  Walker  (1984)  give  a  formulation  for 
the  general  EM  approach.  For  a  given  vector  of  parameters 

0"  which  is  the  current  approximate  maximum 

likelihood  estimate  of  the  log-likelihood  function  logL(01x),  the  next 

approximate  maximum  likelihood  estimate  ,9^y 

of  the  log-likelihood  function  is  given  by 
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Figure  10.  Generalized  mixture  model  (a)  and  traditional  normal  mixture  (b) 
decision  surfaces  for  the  two-class  Flick  data  problem. 


Note  that  the  term  is  the  posterior  probability 

that  originated  from  the  zth  component  population,  given  the  current 
maximum  likelihood  estimate  <t>\ 

At  each  iteration,  the  EM  algorithm  guarantees  that 

logL(0^1jc)>logL(0‘lx).  Given  an  a  priori  selection  of  the  number  of 
mixture  components,  this  approach  iteratively  determines  the 
proportional  and  parametric  components  of  each  component.  An 
information  criterion  relating  density  estimate  fitness  and  model 
complexity  is  required  for  comparison  of  mixture  density  estimates  with 
a  v^iying  number  of  components. 

Mixture  model  order  identification 

Models  are  generally  compared  by  their  complexity  and  how  well 
they  fit  the  data,  with  the  goal  of  maximizing  model  fitness  while 
minimizing  model  complexity.  These  constraints  generally  are 
diametrically  opposed,  i.e.,  increasing  model  complexity  will  generally 
allow  for  increasing  the  model  fitness.  To  measure  how  well  the  models 
fit  the  data,  a  function  incorporating  the  likelihood  function  is 
generally  used.  The  likelihood  function  has  several  desirable  properties 
which  make  it  appropriate  for  measuring  the  relationship  between 
model  and  data.  Cramer  (1946)  demonstrates  that  under  certain 
regularity  conditions,  the  maximum  likelihood  estimators  (MLE)  of  a 
multivariate  model  have  solutions  which  are  asymptotically  normal 
and  joint  asymptotically  efficient  estimates  of  the  parameters. 
Therefore,  the  maximum  likelihood  estimates  of  a  set  of  parameters 
have  the  smallest  variance  about  the  true  parameter  values  for  all 
unbiased  estimators.  The  MLE  thus  provides  a  very  sensitive  measure  of 
fitness  between  model  and  data. 

Many  criteria  of  model  complexity  determination  have  been 
developed  from  the  concept  of  maximizing  the  likelihood  function 
while  penalizing  the  number  of  free  parameters  required.  These  criteria 
include  the  generalized  likelihood  ratio  test  (Casella  and  Berger  1990), 
Akaike's  information  criterion  (Akaike  1974),  and  the  minimum 
description  length  (MDL)  criterion  (Rissanen  1986).  The  MDL  criterion  is 
formulated  as 


MDL{x)=  min|-logL(6lx)  +  -~logn|  (7) 

where  L(6\x)  is  the  likelihood  function  of  the  model,  k  is  the  number  of 
the  free  parameters  used  to  represent  the  data,  and  n  is  the  cardinality 
of  X.  It  is  interesting  to  note  that  Schwarz’s  criterion  (Schwarz  1978),  a 
Bayesian  derived  criterion  for  model  selection,  is  identical  to  the  MDL 

criterion.  Given  a  data  sample  x,  a  density  estimate  model  M.  is 

"better”  than  another  model  Mj  if  MDL(x\M^)<  MDL{x\M-),  In  the 


present  work,  the  MDL  criterion  provides  the  likelihood-based  measure 
of  model  fitness  in  the  mixture  mcxJel  optimization  process. 


2  STOCHASTIC  OPTIMIZATION 

Random  (or  stochastic)  search  techniques  have  been  used  for  function 
optimization  since  the  1950s.  Stochastic  search  strategies  are 
competitive  with  or  superior  to  traditional  search  strategies  (such  as 
gradient  search  techniques)  when  the  cost  or  objective  function  under 
optimization  is  difficult  to  compute,  or  when  the  function  to  be 
minimized  has  many  suboptimal  solutions  (local  minima).  Other 
advantages,  enumerated  by  Karnopp  (1963),  include  the  ease  of 
programming,  inexpensive  realization  of  possible  solutions,  as  well  as 
flexibility  in  the  expression  of  the  criterion  function. 

Stochastic  optimization  techniques  are  based  on  either  single  point  or 
multiple  agent  algorithms.  Single  point  algorithms  include  the  random 
walk,  the  creeping  random  method  (Brooks  1958),  and  the  method  of 
Solis  and  Wets  (1981).  Multiple  agent  stochastic  search  algorithms, 
such  as  genetic  algorithms  (Goldberg  1989),  evolution  strategies  (Back 
and  Schwefel  1993),  and  evolutionary  programming  (Fogel  1991),  are 
becoming  well  known  for  their  optimization  properties. 

Figure  1  is  a  graphical  representation  of  an  evolutionary  programming 
(EP)  algorithm.  A  population  of  models  (solutions)  generates  new 
models  (offspring)  via  mutation,  and  then  the  complete  population 
competes  for  survival  to  the  next  iteration.  In  EP,  mutation  consists  of 
random  perturbations  of  the  parameters,  with  the  magnitude  of  the 
perturbation  generally  self-adaptive  or  tied  to  the  fitness  of  the  parent. 
For  the  process  of  real-valued  parameter  optimization,  the  mutation 
process  generally  consists  of  perturbing  a  parameter  value  with  a  normal 
or  lognormally  distributed  random  variable.  The  normal  perturbation 
scheme  is  given  as  the  following: 

^e-\-NiOJ-s  +  z)  (8) 

where  /  is  the  measure  of  error  of  the  parent  model,  and  z  and  s  are  an 
offset  and  scale  factor,  respectively.  The  random  nature  of  this 
mutation,  although  useful  in  escaping  local  minima,  is  inefficient  for 

direct  minimization.  Also,  the  use  of  the  error  value  /  in  the  variance 
term  /•.?+  z  allows  the  variance  to  shrink  as  the  error  decreases  (model 

fitness  improves),  but  is  properly  defined  only  when  /  -  j  +  z  >  0.  Several 
approaches  have  been  developed  to  avoid  any  association  of  the  fitness 
value  with  the  mutation  process,  and  to  speed  the  convergence  to 
optimal  solution  (Waagen  et  al.  1992;  McDonnell  and  Waagen  1994).  The 
next  section  describes  the  hybrid  approach  developed  in  this  work  for 
mixture  model  selection  and  optimization. 


Figtire  1.  Evolutionary  programming  optimization  algorithm.  The  mutation  and 
competition  steps  are  stochastic  in  nature. 


3  STOCHASTIC-DETERMINISTIC  MIXTURE  DENSITY 
ESTIMATION 

This  section  details  the  process  of  determining  a  multivariate  mixture 
model  from  a  set  of  independent,  identically  distributed  data.  Preceding 
the  discussion  of  the  optimization  process,  a  discussion  of  the 
distributional  form  of  the  components  which  make  up  the  mixture  is  in 
order. 

Generalized  mixture  component  formulation 

The  functional  form  of  the  mixture  component  distribution  is  based  on  a 
generalized  kernel  function  described  by  Fukunaga  (1990).  The  functional 

form  of  the  generalized  mixture  components,  denoted  as  /(xljU,Z,m),  is 
given  as 


-{Si''-"''"-'*'-"’}" 

(9) 

where  n  is  the  dimensionality  of  the  data,  fi  is  the  mean,  X  is  the 
covariance  matrix,  and  m  is  the  shape  parameter  of  the  mixture 
component.  The  probability  density  function  of  the  ^-component  mixture 
distribution  is  therefore  given  as 

ax)  =  f^aJ{x]^„'L„m)  (10) 

»=1 

The  shape  parameter  m  allows  the  function  to  include  both  the 
multivariate  normal  and  multivariate  uniform  distributions  as  special 

cases.  A  graph  of  the  univariate  distribution  of  /(x|0,l,m)  for  two  values 
of  m  is  given  in  Figure  2. 
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Figure  2.  Distribution  of  /(jc|0,l,/?i)  from  Eq.  9  for  m  =  1, 100.  Special  cases  of 
Eq.  9  include  the  multivariate  uniform  and  normal  distributions. 

Hybrid  multiagent  approach  to  optimization 

The  multiagent  optimization  process  investigated  in  this  work  combines 
a  stochastic  mutation  process  with  the  deterministic  EM  algorithm  for 
component  number  and  parameter  optimization.  A  population  of  N 
possible  solutions  (mixture  models)  is  generated  and  maintained 
throughout  the  optimization  process.  Optimization  consists  of  two  steps, 
as  graphically  shown  in  Figure  3.  Optimization  ends  either  after  a  fixed 
number  of  iterations  or  after  the  best  mixture’s  error  value  remains 
constant  for  a  significant  number  of  iterations. 

One  issue  with  the  combined  stochastic-deterministic  optimization 
approach  is  the  fact  that  the  parameters  of  a  mutated  offspring  are  not 
optimized  for  its  component  or  shape  representation,  whereas  the 
parameters  values  of  the  offspring’s  parent  have  been  optimized  (to 
some  extent)  for  the  mixture’s  shape  and  component  number.  Therefore 
the  offspring  models  are  initially  at  a  disadvantage  to  the  parent 
models  in  the  competition  process.  An  attempt  to  alleviate  this  problem 
is  made  by  allowing  each  offspring  model's  parameters  to  be  optimized 
via  the  EM  algorithm  for  several  iterations  before  model  selection 
occurs.  This  optimization  might  be  biologically  analogous  to  the 
environmental  learning  phase  of  childhood  and  adolescence,  but  this 
paper  will  not  try  to  justify  or  defend  this  analogy  as  truth. 


Figure  3.  Multiagent  mixture  optimization  procedure.  The  stochastic  and 
deterministic  portions  of  the  approach  are  correspondingly  labeled.  Note  that 
the  competition  is  deterministic,  and  the  best  N  models  are  always  selected  for 
survival. 

The  first  step  is  mutation,  consisting  of  the  generation  of  three  mixture 
models  from  each  surviving  parent  model.  The  goal  of  mutation  is  to 
optimize  of  the  number  of  components  in  the  mixture  and  the  shape 
parameters  of  each  component.  Three  new  models  (offspring)  are 
produced  via  the  following  rules: 

Offspring  1:  Add  or  remove  a  randomly  selected 
component. 

Offspring  2:  Modify  the  shape  parameter  of  a  randomly 
selected  component  via  the  positive  value  of  a  random 
normal  perturbation: 


fm  +  |A/(0.c)|  i/»z  +  |7V(0,c)|<A/^* 

where  is  an  arbitrary  constant  (200  in  this 

investigation). 

Offspring  3:  Modify  the  shape  parameter  of  the  same 
selected  component  (as  offspring  2)  via  the  negative 
value  of  the  same  (as  offspring  2)  random  normal 
perturbation: 


(m-\N(0.c)\  ifm-\N(0,c)\>l 
I  1  if  m-\N(0,c)\<l 


(12) 


After  the  offspring  models  have  been  generated  and  the  EM  algorithm 
has  been  applied  repeatedly  to  each  child,  the  combined  population  of 
models  is  individually  passed  through  a  single  iteration  of  the  EM 
algorithm.  In  each  model,  the  EM  algorithm  optimizes  the  mean  and 
covariance  parameters,  as  well  as  the  component  proportion  values, 
according  to  equations  (5)  and  (6).  The  fitness  of  each  model  is  then 
computed  using  the  MDL  (7),  the  best  N  models  (from  the  population  of 
4N  models)  are  kept,  and  the  process  is  repeated. 


4  MIXTURE  MODEL  OPTIMIZATION  CHARACTERISTICS 

To  test  the  capability  of  the  optimization  process,  a  two-dimensional 
classification  problem  is  posed.  Two-dimensional  problems  aid  in 
visualization  and  interpretation.  Barring  sample  size  issues,  the 
generalized  mixture  technique  is  directly  applicable  to  higher 


dimensional  problems.  In  the  experiment,  training  samples  are  created 
from  each  class's  underlying  distribution.  These  samples  are  used  to 
derive  generalized  mixture  models  for  each  class.  For  comparison 
purposes,  optimal  (in  the  MDL  sense)  normal-component  models  are  also 
derived  for  the  data. 

Flick  data 

To  test  the  capability  of  the  density  estimation  process,  and  to  assess 
the  classification  capability  of  the  generalized  mixture  model 
approach,  two  overlapping,  piecewise-continuous  distributions 
previously  described  by  Hick  et  al.  (1990),  (herein  labeled  as  classes  1 
and  2)  are  estimated  via  the  mixture  distributions  discussed  in  section  3. 
The  probability  density  functions  for  class  1  [class  2]  are  defined  on  the 
unit  square  as  the  following: 


nx,y) 


'3.0 

[0.0] 

0.86 

[0.14] 

0.14 

[0.86] 

0.0 
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0<y<0.25 
0. 25  <y<b(x) 
b{x)<y<Q.15 
0.75<y<1.0 


(13) 


where  b(x)  =  0.5-0.25cos(2;n:).  Figure  4  graphically  displays  the 
underlying  probability  density  function  for  each  class.  These 
distributions  overlap,  so  that  the  optimal  classification  rule,  based  on 
knowledge  of  the  true  underlying  probability  distribution  of  each  class, 
will  misclassify  (on  average)  3.5%  of  the  new  data  sample  presented  for 
classification  (3.5%  of  each  class). 


(i.i.o; 


Figure  4.  Probability  density  functions  for  classes  1  (top)  and  2  (bottom). 

The  training  sets  consist  of  200  samples  of  each  class.  The  training 
sample  for  each  class  is  shown  in  Figure  5.  For  the  model  optimization 
process,  each  class  was  optimized  separately,  using  the  MDL  as  the 
measure  of  model  fitness.  The  model  population  number  N  (of  Figure  3) 
was  arbitrarily  set  to  5. 


Figure  5.  Training  sets  of  the  two-class  problem.  Each  sample  consists  of  200 
points. 

For  both  classes,  the  algorithm  converged  to  a  solution  within  50 
iterations.  The  value  of  the  MDL  for  the  "best”  mixture  model  and  its 
corresponding  number  of  components  at  each  iteration  are  shown  in 
Figure  6.  Table  1  and  Figure  7  display  the  resulting  mixture  models 


derived  by  the  algorithm  from  the  sample  data.  The  results 
demonstrate  that  the  algorithm  quickly  minimizes  the  number  of 
components  in  its  optimization  process,  with  both  data  samples  being 
optimally  modeled  by  two-component  mixture  models. 


(a) 


(b) 


Figure  6.  Best  model’s  MDL  score  (a)  and  number  of  components  (b).  Model 
optimization  occurs  quickly,  with  both  density  models  optimized  within  the  first 
20  iterations. 


Table  1.  Final  generalized  mixture  model  estimates  for  each  class  derived  via 
stochastic-deterministic  optimization.  Each  model  was  derived  from  200  data 
points. 
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Figure  7.  Generalized  mixture  models  (class  1  (a),  class  2  (b))  generated  by 
stochastic-deterministic  optimization. 


Normal  component  mixture  model  comparison 

To  investigate  its  utility,  the  generalized  mixture  approach  is  compared 
with  the  traditional  mixture  approach.  The  traditional  mixture  consists 
of  normally  distributed  components.  Computing  models  using  normal 


components  illuminates  the  representation  and  classification  capability 
of  the  generalized  mixture  models. 

Optimal  (in  terms  of  the  mixture’s  MDL  value  for  the  training  data 
set)  normal  component  mixture  density  estimates  were  computed  from 
the  training  data.  Figure  8  display  the  normal  mixture  models  of  each 
class. 

The  densities  generated  via  the  two  approaches  are  given  in  Table  2. 
As  shown  in  the  table,  the  generalized  mixture  approach  produces 
superior  models  in  terms  of  the  log-likelihood  function,  the  total  number 
of  free  parameters  used  for  data  representation,  and  the  MDL  criteria. 
This  superiority  is  due  to  the  added  flexibility  of  component  shape 
mdidification.  The  next  section  compares  the  classification  capability  of 
these  density  models. 


Figure  8.  Optimal  normal  component  mixture  models  (class  1  (a),  class  2  (b)). 


Table  2.  Comparison  of  fitness  characteristics  of  generalized  and  normal 
mixture  models  derived  from  the  training  data.  The  models  are  optimal  with 
respect  to  the  training  data  and  the  criterion  function,  the  MDL  (Eq.  7). 


Feature 

Generalized  Mixture 

Normal  Mixture 

Class  1 

Class  2 

Class  1 

Class  2 

2 

2 

2 

3 

Log-likelihood 

function 

115.585 

95.274 

106.177 

94.270 

Number  of  free 
parameters 

13 

13 

11 

17 

MDL  score 

-81.5104 

-60.8346 

-77.0359 

-49.2344 

5  CLASSIFICATION  RESULTS 


The  mixture  models  produced  are  probability  density  functions,  so  it  is 
natural  to  use  Bayes  rule  as  the  decision  rule  for  classification  of  new 


data.  Given  a  set  of  c  possible  classes  (o^,  Bayes  rule  is  given  as  the 
following: 


X/(xl©,.0,)p((»,) 


(14) 


where  /’(6).lx,0.)  is  the  posterior  probability  of  a  new  sample  x  is 

as^ciated  to  class  (U..  A  new  data  sample  is  assigned  to  the  class  with 
the  largest  posterior  probability.  With  the  assumption  that  the  prior 
probabilities  p{(0^)  of  each  class  are  equal  (an  assumption  we  make  for 
this  two-class  example  case),  the  classification  rule  can  be  written  as 


Assign  X  to  class 


if  f{x\(0,,e,)>f{x\(0^,d^) 
otherwise 


(15) 


The  optimal  discriminant  boundary  for  the  underlying  distributions, 
given  the  decision  rule  of  equation  (15),  is  displayed  in  Figure  8.  To 
compare  the  discriminant  boundary  produced  by  the  mixture  distribution 
estimates  and  the  classification  rule  with  the  optimal  decision 
boundary,  a  plot  of  the  decision  surface  generated  by  the  mixture 
estimates  was  computed.  The  mixture  model  decision  surface  is  shown  in 
Figure  9. 

Given  the  small  number  of  samples  (1000  samples  of  each  class  were 
used  for  training  by  Flick  et  al.),  the  mixture  estimates  do  well  in 
characterizing  the  decision  surface  of  the  underlying  distribution.  The 
optimized  generalized  mixtures  are  next  compared  with  the  normal 
component  mixture  models  of  their  classification  capability. 


Figure  9.  Optimal  decision  boundary  for  underlying  class  distributions. 


Classification  comparison  of  the  normal  component  mixtxire  model 
estimates 

To  compare  the  classification  characteristics  of  the  two  mixture  model 
paradigms,  six  sets  of  5000  samples  from  each  class  were  randomly 
generated  and  classified  according  to  equation  (15).  The  classification 
performance  of  the  generalized  and  normal  mixture  model  estimates  on 
these  data  sets  is  given  in  Table  2. 


Table  3.  Classification  performance  of  generalized  and  normal  mixture 
mcfdels  on  Flick  test  data.  Numbers  correspond  to  number  of  correctly 
classified  samples  from  a  test  sample  of  10000  points  (5,000  points  from  each 
class). 


Data  Set 

Generalized  Mixtures 

Normal  Mixtures 

1 

9440 

9481 

2 

9456 

9487 

3 

9425 

9466 

4 

9418 

9448 

5 

9412 

9428 

6 

9452 

9464 

As  noted  in  the  previous  section,  the  theoretically  optimal  classifier 
will  on  average  correctly  classify  96.50%  of  the  samples  presented,  due 
to  the  overlap  of  the  class  distributions.  The  table  demonstrates  that 
both  mixture  model  paradigms  produced  excellent  results,  with  the 
generalized  component  mixture  models  correctly  classifying  94.34%  of 
the  test  samples  versus  94.62%  for  the  normal  component  mixture  models. 
To  test  if  the  classification  difference  in  the  mixture  approaches  is 
statistically  significant,  a  Wilcoxon  paired  sample  test  was  applied. 
For  the  data  in  Table  3,  a  Wilcoxon  statistic  returns  a  p~value  of  0.05. 
Therefore  a  statistically  significant  difference  is  detected  (at  the 
a  =  0.05  level  of  significance)  in  the  classification  performance  of  the 
two  models.  Note,  however,  that  this  difference  is  on  average  only 
0.28%,  and  as  demonstrated  in  the  statistics  of  Table  2,  the  generalized 
mixture  models  are  more  parsimonious  (i.e.,  they  require  fewer  free 
parameters  to  represent  the  two  classes). 


6  CONCLUSIONS 

Mixture  models  provide  an  excellent  nonparametric  approach  for 
density  estimation  and  pattern  classification.  The  algorithm  described 
by  this  paper  frees  a  mixture  method  practitioner  from  having  to  make 
an  a  priori  estimate  of  the  number  of  components  required  by  a  mixture 
for  optimal  representation,  and  allows  the  practitioner  to  use  shape- 
parameterized  distributions  as  the  functional  basis  of  the  mixture 
components. 


This  work  has  demonstrated  that  the  combination  of  a  multiagent 
stochastic  search  technique  and  the  EM  algorithm  can  produce  sound 
probability  density  estimates  for  multivariate  data.  The  mixture 
distributions  produced  by  the  multiagent  optimization  process  display 
promising  classification  capabilities.  Although  the  study  is  too  limited 
for  statements  concerning  the  general  classification  capabilities  of  the 
algorithm  and  the  mixture  estimates  it  produces,  this  work 
demonstrates  that  elegant  classifier  systems  can  be  produced  from 
algorithmically  optimized  mixture  models. 
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